fix(reborn): remove per-record lock convoys via shared cas_update helper#5234
fix(reborn): remove per-record lock convoys via shared cas_update helper#5234henrypark133 wants to merge 16 commits into
Conversation
Extract the proven mutex-free CAS pattern from ironclaw_turns (PR #5142) into one shared helper in ironclaw_filesystem so the remaining stores can drop their per-record tokio::sync::Mutex convoys. cas_update runs a bounded read-modify-write loop: read versioned snapshot, run the caller's idempotent apply closure, CAS-put with the read version, and on VersionMismatch re-read and retry with jittered exponential backoff (2ms..50ms), capped at 32 retries and wrapped in a 15s timeout. A fail-closed capability gate rejects backends that cannot CAS rather than silently blind-overwriting. The helper is generic over the record type, the outcome, and the caller's error; it never leaks store-specific types. No store migrated yet — that follows in the next commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te helper ironclaw_resources had NO CAS-retry loop — its per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) held across the backend get/put awaits was the ONLY serializer. Under burst, one writer stalled inside its critical section parked every other same-scope writer (the convoy that contributed to the 2026-06-24 runtime wedge). Naively deleting that mutex without a retry loop loses updates: a racing writer's single-attempt CAS detects the version mismatch and errors out (cross-process CAS contention) instead of retrying. Route update_snapshot through ironclaw_filesystem::cas_update (bounded CAS retry + jittered backoff + 15s timeout + fail-closed capability gate) and delete the per-record mutex, FilesystemRecordLock, the local PutError, and the fail-open put_with_cas (CasExpectation::Any blind-overwrite fallback). The helper fails closed instead; verified safe — these store aliases only ever resolve to CAS-capable db/in-memory backends in production. The public update API widens from FnOnce to FnMut because the helper re-runs the closure against a freshly read snapshot on every CAS retry; leaf closures that previously moved captured values now clone per invocation. Add PartialEq to BudgetGateSnapshot (helper needs S: Clone + PartialEq). Red->green regression: cas_snapshot::tests::concurrent_increments_have_no_lost_updates proves RED on lock-removed-no-retry (lost updates / spurious contention) and GREEN through the helper (every concurrent increment lands, no convoy). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ant per-record mutex run_state held BOTH a store-local CAS-retry loop AND a per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) across the backend get/put awaits — belt-and-suspenders where the mutex was a redundant in-process serializer over a backend that already does versioned CAS. Under burst the mutex convoyed same-scope writers behind one stalled writer (the 2026-06-24 wedge pattern). Route apply_update, update_status, start, and save_pending through ironclaw_filesystem::cas_update (one shared CAS implementation: bounded retry + jittered backoff + 15s timeout + fail-closed capability gate). Delete the store-local FILESYSTEM_CAS_RETRIES loop, the per-record lock + accessor + its two unit tests, the local PutError, and the fail-open put_with_cas (CasExpectation::Any blind-overwrite fallback). The scope-ownership check and the approval Pending guard move inside the re-runnable apply closure; the ApprovalStatus guard returns Err from apply. discard_pending keeps its read-then-delete, only the lock is dropped. The run_state contract tests previously ran against LocalFilesystem (byte-only, no versioned CAS) where the OLD fail-open fallback masked the missing CAS. Production never routes run_state to LocalFilesystem — only to CAS-capable db/in-memory backends — so the tests now use InMemoryBackend, matching the production capability shape (CAS) instead of a non-production blind-overwrite path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p per-record mutex ensure_thread held the only per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) in this crate, across the backend get/put awaits, to serialize the check-then-create against Unsupported(WriteFile)-fallback backends. Route it through ironclaw_filesystem::cas_update: the apply closure returns the existing record unchanged (no-op, no write) when the thread already exists with a matching scope, or builds the fresh StoredThreadRecord when absent. A concurrent create-if-absent loser hits VersionMismatch, the helper re-reads and re-runs apply which now sees the winner and reconciles scope — exactly the old single-reconcile semantics, now via the shared CAS-retry loop without any lock. Delete the lock infra (Weak map + accessor); add PartialEq to StoredThreadRecord (helper needs S: Clone + PartialEq). The three other record/txn loops were already lock-free and are left as-is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cord mutex (Arc leak fix) The secrets filesystem store kept FILESYSTEM_RECORD_LOCKS as a HashMap of *strong* Arc<Mutex<()>> (not Weak), one entry per (scope,lease)/session path ever seen, NEVER pruned — an unbounded memory leak. The per-record mutex was also held across the backend get/put awaits, convoying same-key writers behind one stalled writer (the 2026-06-24 wedge pattern). Route consume/revoke/consume_session_use through ironclaw_filesystem::cas_update (one shared CAS implementation), mapping the local CasDecision onto CasApply: Commit/BestEffortCommit -> changed snapshot; Settle(Ok) -> unchanged snapshot (helper skips the write via PartialEq no-op); Settle(Err)/not-found -> apply error. Crypto decrypt + use-count increment run inside the re-runnable apply closure (pure / recomputed from the freshly read record each retry). validate_session is a pure read — its lock is simply deleted. Delete the entire lock map (leak gone), cas_mutate/CasDecision/CAS_RETRY_ATTEMPTS, and the fail-open put_with_version_fallback (helper fails closed). Add PartialEq to StoredLease and StoredSession. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR #5142 removed the per-record mutex from ironclaw_turns but left a local copy of the CAS read-modify-write loop (apply_with_retry, put_with_cas, PutError, cas_retry_backoff + constants). Re-home it onto ironclaw_filesystem::cas_update so there is ONE CAS implementation across the codebase. Behavior is preserved: the 18 call sites and their closure shape (FnMut(InMemoryTurnStateStore) -> Fut) are unchanged; the exact timeout/exhaustion error strings are preserved. A BridgeError<T> sentinel carries the absent-record + default-snapshot no-op through the helper's apply-error channel (the helper's own no-op check only fires for Some(existing)==new; the turns store additionally must skip creating a file for an empty default store, which the old new==old check covered because read returned default on absent). The 500ms SNAPSHOT_READ_CACHE_TTL read cache is kept as a separate layer: cleared around the CAS call and repopulated on the next read (the helper does its own fresh get and does not surface the new RecordVersion; every read_snapshot caller already discards the version, so caching None is safe). Delete the now-unused local loop, put_with_cas, PutError, cas_retry_backoff, and the duplicate CAS constants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cord mutex Add the invariant to crates/ironclaw_filesystem/CLAUDE.md (#2 "CAS is the floor") and a pointer in .claude/rules/database.md: every filesystem read-modify-write must go through the one shared ironclaw_filesystem::cas_update helper; never wrap it in a per-record tokio::sync::Mutex held across the backend .await (redundant serializer + convoy/wedge risk + leak). Also commit Cargo.lock (async-trait dev-dep added to ironclaw_resources for the convoy regression test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request migrates multiple persistence stores (ironclaw_resources, ironclaw_run_state, ironclaw_threads, ironclaw_secrets, and ironclaw_turns) to use a shared, lock-free compare-and-swap (CAS) helper (cas_update). This eliminates redundant per-record mutexes held across .await boundaries, preventing runtime-wedging convoys under high contention. Feedback on the changes highlights an opportunity to improve the jitter calculation in cas_retry_backoff using RandomState to avoid zero-jitter scenarios in low clock resolution environments (such as VMs or Windows). Additionally, it is recommended to explicitly add Send bounds to the closure generic constraints in cas_update and cas_update_loop to catch any non-Send regressions at the helper's definition site.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| async fn cas_retry_backoff(attempt: usize) { | ||
| let shift = attempt.min(8) as u32; | ||
| let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX); | ||
| let base_delay = FILESYSTEM_CAS_BACKOFF_BASE | ||
| .saturating_mul(multiplier) | ||
| .min(FILESYSTEM_CAS_BACKOFF_MAX); | ||
| let jitter = SystemTime::now() | ||
| .duration_since(UNIX_EPOCH) | ||
| .map(|elapsed| { | ||
| let jitter_ceiling = base_delay.as_millis().max(1); | ||
| Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64) | ||
| }) | ||
| .unwrap_or_default(); | ||
| tokio::time::sleep(base_delay.saturating_add(jitter)).await; | ||
| } |
There was a problem hiding this comment.
Using SystemTime::now() for jitter calculation can lead to zero jitter on systems with low clock resolution (e.g., 1ms resolution in VMs, containers, or Windows). Since base_delay.as_millis() is at most 50, and 1,000,000 (1ms in nanoseconds) is divisible by all common backoff ceilings (2, 4, 8, 16, 32, 50), the modulo operation elapsed.as_nanos() % jitter_ceiling will always evaluate to exactly 0. This completely defeats the purpose of jitter under high contention.
Instead, use std::collections::hash_map::RandomState to obtain a high-quality, thread-safe, and un-correlated pseudo-random value seeded from the OS without adding external dependencies.
| async fn cas_retry_backoff(attempt: usize) { | |
| let shift = attempt.min(8) as u32; | |
| let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX); | |
| let base_delay = FILESYSTEM_CAS_BACKOFF_BASE | |
| .saturating_mul(multiplier) | |
| .min(FILESYSTEM_CAS_BACKOFF_MAX); | |
| let jitter = SystemTime::now() | |
| .duration_since(UNIX_EPOCH) | |
| .map(|elapsed| { | |
| let jitter_ceiling = base_delay.as_millis().max(1); | |
| Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64) | |
| }) | |
| .unwrap_or_default(); | |
| tokio::time::sleep(base_delay.saturating_add(jitter)).await; | |
| } | |
| async fn cas_retry_backoff(attempt: usize) { | |
| let shift = attempt.min(8) as u32; | |
| let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX); | |
| let base_delay = FILESYSTEM_CAS_BACKOFF_BASE | |
| .saturating_mul(multiplier) | |
| .min(FILESYSTEM_CAS_BACKOFF_MAX); | |
| let jitter = { | |
| use std::collections::hash_map::RandomState; | |
| use std::hash::{BuildHasher, Hash, Hasher}; | |
| let mut hasher = RandomState::new().build_hasher(); | |
| attempt.hash(&mut hasher); | |
| let hash = hasher.finish(); | |
| let jitter_ceiling = base_delay.as_millis().max(1) as u64; | |
| Duration::from_millis(hash % jitter_ceiling) | |
| }; | |
| tokio::time::sleep(base_delay.saturating_add(jitter)).await; | |
| } |
There was a problem hiding this comment.
Fixed in 0eaf69d — jitter now derives from a RandomState-seeded hash (clock-independent), so it no longer collapses to 0 on coarse-resolution clocks. See cas_retry_backoff in crates/ironclaw_filesystem/src/cas.rs.
| pub async fn cas_update<F, S, T, E, D, N, A, Fut>( | ||
| filesystem: &ScopedFilesystem<F>, | ||
| scope: &ResourceScope, | ||
| path: &ScopedPath, | ||
| decode: D, | ||
| encode: N, | ||
| mut apply: A, | ||
| ) -> Result<T, CasUpdateError<E>> | ||
| where | ||
| F: RootFilesystem + ?Sized, | ||
| S: PartialEq + Clone, | ||
| D: Fn(&[u8]) -> Result<S, E>, | ||
| N: Fn(&S) -> Result<Entry, E>, | ||
| A: FnMut(Option<S>) -> Fut, | ||
| Fut: Future<Output = Result<CasApply<S, T>, E>>, |
There was a problem hiding this comment.
According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.
pub async fn cas_update<F, S, T, E, D, N, A, Fut>(
filesystem: &ScopedFilesystem<F>,
scope: &ResourceScope,
path: &ScopedPath,
decode: D,
encode: N,
mut apply: A,
) -> Result<T, CasUpdateError<E>>
where
F: RootFilesystem + ?Sized,
S: PartialEq + Clone + Send,
T: Send,
E: Send,
D: Fn(&[u8]) -> Result<S, E> + Send,
N: Fn(&S) -> Result<Entry, E> + Send,
A: FnMut(Option<S>) -> Fut + Send,
Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,References
- In Rust, explicitly add
Sendbounds to closure generic constraints in async helper functions that return futures intended to beSend. This ensures that non-Sendregressions are caught at the helper's definition site rather than at the call site, even if the future is already bound bySendat the call site.
There was a problem hiding this comment.
Considered but declined. Adding Send to the helper generics forces propagating Send across consumer-crate generics (run_state mutate: M, turns persistence T) — threading a bound through multiple layers for marginal "catch at the def site" benefit, which .claude/rules/architecture.md flags as bound-creep. The futures are already Send-constrained where awaited/spawned.
| async fn cas_update_loop<F, S, T, E, D, N, A, Fut>( | ||
| filesystem: &ScopedFilesystem<F>, | ||
| scope: &ResourceScope, | ||
| path: &ScopedPath, | ||
| decode: &D, | ||
| encode: &N, | ||
| apply: &mut A, | ||
| ) -> Result<T, CasUpdateError<E>> | ||
| where | ||
| F: RootFilesystem + ?Sized, | ||
| S: PartialEq + Clone, | ||
| D: Fn(&[u8]) -> Result<S, E>, | ||
| N: Fn(&S) -> Result<Entry, E>, | ||
| A: FnMut(Option<S>) -> Fut, | ||
| Fut: Future<Output = Result<CasApply<S, T>, E>>, |
There was a problem hiding this comment.
According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.
async fn cas_update_loop<F, S, T, E, D, N, A, Fut>(
filesystem: &ScopedFilesystem<F>,
scope: &ResourceScope,
path: &ScopedPath,
decode: &D,
encode: &N,
apply: &mut A,
) -> Result<T, CasUpdateError<E>>
where
F: RootFilesystem + ?Sized,
S: PartialEq + Clone + Send,
T: Send,
E: Send,
D: Fn(&[u8]) -> Result<S, E> + Send,
N: Fn(&S) -> Result<Entry, E> + Send,
A: FnMut(Option<S>) -> Fut + Send,
Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,References
- In Rust, explicitly add
Sendbounds to closure generic constraints in async helper functions that return futures intended to beSend. This ensures that non-Sendregressions are caught at the helper's definition site rather than at the call site, even if the future is already bound bySendat the call site.
There was a problem hiding this comment.
Same decision as the cas_update bound above — declined to avoid propagating Send across consumer-crate generics for marginal benefit.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughFilesystem-backed read-modify-write paths now use ChangesShared filesystem CAS migration
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
ironclaw/crates/ironclaw_run_state/src/lib.rs
Line 932 in 3223540
When discard_pending races in the same process with approve or deny for the same request, this unconditional delete can now run after the resolver's CAS-protected status update and remove the terminal approval record. The removed per-record lock used to serialize that get/check/delete sequence with status updates; without either that lock or a CAS/tombstone transition here, a user approval can be lost and later look like UnknownApprovalRequest rather than Approved.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/ironclaw_run_state/src/lib.rs (1)
922-932: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftMake
discard_pendingatomic before dropping the lock.This path reads
Pending, then deletes later with no CAS expectation. A concurrentapprove()/deny()can win the CAS update between Line 922 and Line 932, then this delete removes a non-pending approval record. Use a CAS-aware delete/transactional transition, or model discard as a CAS status update instead of read-then-delete. As per path instructions: “Fail loud” and filesystem read-modify-write paths must use the CAS invariant rather than split multi-step state changes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_run_state/src/lib.rs` around lines 922 - 932, The discard_pending flow in RunState::discard_pending is currently split into a read/validate step and a later filesystem delete, which can race with approve()/deny() and remove a record after its status has already changed. Update this path to be atomic by using a CAS-aware transition or transactional delete that enforces the same status invariant at the point of mutation, rather than relying on a prior Pending check; keep the failure path loud if the expected state no longer matches.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ironclaw_filesystem/src/cas.rs`:
- Line 193: The new Clippy allow on cas.rs needs the required arch-exempt
rationale comment immediately above #[allow(clippy::too_many_arguments)]. Update
the same site near the cas:: function signature that triggered the lint by
adding the mandatory // arch-exempt: too_many_args, <reason naming the missing
aggregation>, plan `#NNNN` annotation, and make sure the reason justifies why this
many parameters is unavoidable until the missing aggregation is introduced.
- Around line 295-306: The CAS preflight is treating
BackendCapabilities::default() as “unknown,” which lets a backend with no
advertised capabilities bypass the fail-closed gate. Update capabilities_known()
in cas.rs to only treat an explicit unknown state as unknown and not overload
BackendCapabilities::empty()/default(), then make the cas_update path rely on
that check so unsupported backends still return CasUpdateError::CasUnsupported
instead of falling back to op-time behavior.
- Around line 262-265: The no-op shortcut in cas_update_loop currently only
returns early when current is Some(existing) and matches the snapshot, while
absent records still fall through to encoding and writing; update the logic to
either treat None plus an unchanged/default snapshot as a true no-op or adjust
the CasApply/cas_update contract and surrounding comments to state the shortcut
only applies to existing records. Use the existing cas_update_loop and CasApply
symbols to keep the behavior and docs aligned, and remove any need for
downstream NoOp suppression if you choose to enforce it here.
In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 184-399: Add a regression test in cas/tests.rs for the timeout
path in cas_update, since the current tests cover retries, CAS rejection, and
apply errors but never exercise CasUpdateError::Timeout. Introduce a tiny wedged
backend or stub that blocks one of the CAS operations (get, put, or apply), and
use paused Tokio time so the outer timeout in cas_update deterministically
fires. Reference cas_update and CasUpdateError::Timeout in the new test, and
assert that the timeout branch wins over the hung backend behavior.
In `@crates/ironclaw_resources/src/cas_snapshot.rs`:
- Line 212: The CasUpdateError::Backend handling in cas_snapshot.rs currently
forwards the backend error string directly via E::storage_from(inner), which can
leak virtual paths and raw backend details into StorageError. Update this
mapping to return a stable sanitized public message from the error conversion
path, and preserve the detailed backend error only in internal diagnostics or
source context associated with E::storage_from.
In `@crates/ironclaw_run_state/src/lib.rs`:
- Line 1109: The `CasUpdateError::Backend` branch in `RunStateError` currently
converts the filesystem failure to a string, which drops the structured
`FilesystemError` context. Update this match arm in `lib.rs` to preserve the
typed error by using the existing `?`-style conversion path or direct typed
`From`/`Into` conversion used elsewhere in this file, so the `FilesystemError`
variant and its path/operation details continue to propagate through
`RunStateError::Filesystem`.
In `@crates/ironclaw_secrets/src/filesystem_store.rs`:
- Around line 479-483: The no-op CAS comment in filesystem_store.rs is
inaccurate: the `already_marked` branch in the lease update path returns
`Err(SecretStoreError::LeaseExpired { lease_id })` and goes through
`CasUpdateError::Apply`, not the unchanged record/`PartialEq` skip path. Update
the comment to describe the actual intent, or change the branch in the relevant
lease/CAS helper to return `CasApply::new(lease, Err(...))` if that is the
desired behavior; use the `already_marked`, `SecretLeaseStatus::Expired`, and
`CasUpdateError::Apply` symbols to locate and align the code with the documented
contract.
In `@crates/ironclaw_turns/src/filesystem_store.rs`:
- Around line 378-385: The CAS error mapper in map_cas_error should not panic on
BridgeError::NoOp, since this is production error-handling code. Replace the
unreachable! fallback in map_cas_error with a typed unavailable TurnError that
preserves context about the unexpected NoOp leak, while keeping the
BridgeError::Real(inner) path unchanged and consistent with CasUpdateError
handling.
In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 137-140: The quality gate in the migration plan is outdated and
should match the repo’s required backend validation for persistence changes.
Update the “Quality gate” section to include the feature-isolation `cargo check`
matrix for the default/postgres build, `--no-default-features --features
libsql`, and `--all-features`, and strengthen the clippy step to the full `cargo
clippy --all --benches --tests --examples --all-features -- -D warnings`. Keep
the existing formatting/tests coverage, but ensure the plan explicitly reflects
the required checks for the legacy persistence migration.
---
Outside diff comments:
In `@crates/ironclaw_run_state/src/lib.rs`:
- Around line 922-932: The discard_pending flow in RunState::discard_pending is
currently split into a read/validate step and a later filesystem delete, which
can race with approve()/deny() and remove a record after its status has already
changed. Update this path to be atomic by using a CAS-aware transition or
transactional delete that enforces the same status invariant at the point of
mutation, rather than relying on a prior Pending check; keep the failure path
loud if the expected state no longer matches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 1c594037-4929-451d-97ad-73e096ed325d
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock,!**/Cargo.lock
📒 Files selected for processing (17)
.claude/rules/database.mdcrates/ironclaw_filesystem/CLAUDE.mdcrates/ironclaw_filesystem/src/cas.rscrates/ironclaw_filesystem/src/cas/tests.rscrates/ironclaw_filesystem/src/lib.rscrates/ironclaw_filesystem/src/scoped.rscrates/ironclaw_resources/Cargo.tomlcrates/ironclaw_resources/src/cas_snapshot.rscrates/ironclaw_resources/src/filesystem_store.rscrates/ironclaw_resources/src/lib.rscrates/ironclaw_run_state/src/lib.rscrates/ironclaw_run_state/tests/approval_resolution_contract.rscrates/ironclaw_run_state/tests/run_state_contract.rscrates/ironclaw_secrets/src/filesystem_store.rscrates/ironclaw_threads/src/filesystem_service.rscrates/ironclaw_turns/src/filesystem_store.rsdocs/plans/2026-06-25-cas-migration.md
|
🚅 Deployed to the ironclaw-pr-5234 environment in ironclaw-ci-preview
|
Fold in PR review (henrypark133) + a durable-write hot-path audit. Adds §12 framing the deployed lease-expiry cascade as three independent layers: - Layer A: in-process lock convoy → #5234 (open); write-behind composes on the post-#5234 CAS path. - Layer B: pool starvation — DEFAULT_POSTGRES_POOL_MAX_SIZE=2 shared across all Postgres FS I/O. Notes the existing 30s checkout guard (closes the infinite hang) but flags pool-too-small + checkout(30s)>apply(15s); cheap mitigations (raise pool, reserve a critical connection, align checkout<apply) prior to and complementary with write-behind. - Layer C: the synchronous hot-path write map (events / governor / thread-append / lease / memory) with the write-behind-vs-batch-coalesce distinction. Events are the top batch-coalesce target (highest churn, O(1) INSERT no CAS); must stay DURABLE (source of truth), per-step flush to preserve live SSE. Memory stays synchronous (FTS-only, no embedding write; agent-initiated, low churn). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve PR #5234 review comments across the shared cas_update helper and its consumers: - cas: replace SystemTime-based jitter (collapses to 0 on coarse clocks) with a RandomState-seeded value; drop the now-unnecessary too_many_arguments allow (helper has 6 args, below clippy threshold). - cas: add explicit CasApply::no_op so absent-record no-ops are signalled directly; make the contract doc true. Deletes the BridgeError::NoOp "success-through-an-error-channel" hack in ironclaw_turns (E is now TurnError directly) and simplifies map_cas_error. - run_state: fix a TOCTOU race in FilesystemApprovalRequestStore:: discard_pending (Codex P1). The non-atomic get->check->delete could remove a record an approve()/deny() CAS had just transitioned. Route discard through cas_update as a version-checked tombstone transition (new ApprovalStatus::Discarded); get/records_for_scope filter it so it still reads as gone. Adds a no-clobber regression test. - turns/secrets: correct inaccurate CAS no-op comment; replace an unreachable!() panic in map_cas_error with a typed Unavailable error. - add a deterministic CasUpdateError::Timeout regression test (paused Tokio time + hanging backend). - docs: restore the feature-isolation cargo check matrix and full clippy gate in the migration plan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/ironclaw_product_workflow/src/approval_interaction/service.rs (1)
158-164: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winClassify discarded approvals as stale before the persistent branch.
For
persistent == true,ApprovalStatus::Discardedhits Line 163 and returnsAlwaysAllowUnsupported, so the new stale arm at Line 199 is unreachable for that path.Proposed fix
- if matches!(status, ApprovalStatus::Denied | ApprovalStatus::Expired) { + if matches!( + status, + ApprovalStatus::Denied | ApprovalStatus::Expired | ApprovalStatus::Discarded + ) {As per coding guidelines, WebUI-facing facade errors must expose the stable sanitized taxonomy for blocked approval/auth/resource and stale/conflict states.
Also applies to: 199-199
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs` around lines 158 - 164, The approval gating in approval_interaction::service::ApprovalInteractionService should classify ApprovalStatus::Discarded as stale before the persistent-only unsupported branch. Update the early status checks around the approval_rejected flow so the stale rejection path is taken for Discarded regardless of persistent, and ensure the later stale arm remains reachable with the sanitized StaleGate taxonomy instead of falling through to AlwaysAllowUnsupported.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ironclaw_filesystem/src/cas.rs`:
- Around line 102-114: The no-op early returns in cas_update can return a
snapshot that is no longer current, which then gets cached by
filesystem_store::apply_from_snapshot and similar callers. Update cas_update so
the equality and explicit no_op branches either revalidate the record/version
before returning or avoid exposing a cacheable “final” snapshot from those
paths; use the cas_update, CasApply::new, and CasApply::no_op branches as the
main touchpoints. Also adjust the surrounding contract/comments so they only
promise what is actually enforced after concurrent writes.
In `@crates/ironclaw_run_state/tests/run_state_contract.rs`:
- Around line 491-510: The regression test in
filesystem_discard_does_not_clobber_resolved_approval is still sequential and
does not force the TOCTOU interleaving it describes. Update this test to use a
store/filesystem harness that can pause discard_pending after it reads a Pending
record, then let approve run on the same request before discard_pending resumes
its write/CAS path. Use FilesystemApprovalRequestStore and
discard_pending/approve to verify the resolved approval is preserved and the
terminal record is not clobbered.
In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 141-145: The quality gate currently uses cargo fmt --all, which
reformats files instead of validating CI-style formatting; update the checklist
entry in the plan to use the repo’s check-mode formatting command instead. Make
this change in the validation list alongside the other cargo check/clippy steps
so the migration plan matches the “Before You Open a PR” requirement and remains
authoritative.
---
Outside diff comments:
In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs`:
- Around line 158-164: The approval gating in
approval_interaction::service::ApprovalInteractionService should classify
ApprovalStatus::Discarded as stale before the persistent-only unsupported
branch. Update the early status checks around the approval_rejected flow so the
stale rejection path is taken for Discarded regardless of persistent, and ensure
the later stale arm remains reachable with the sanitized StaleGate taxonomy
instead of falling through to AlwaysAllowUnsupported.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: e8a5fca8-3068-48d7-92df-53d101912fa5
📒 Files selected for processing (10)
crates/ironclaw_capabilities/src/helpers.rscrates/ironclaw_filesystem/Cargo.tomlcrates/ironclaw_filesystem/src/cas.rscrates/ironclaw_filesystem/src/cas/tests.rscrates/ironclaw_product_workflow/src/approval_interaction/service.rscrates/ironclaw_run_state/src/lib.rscrates/ironclaw_run_state/tests/run_state_contract.rscrates/ironclaw_secrets/src/filesystem_store.rscrates/ironclaw_turns/src/filesystem_store.rsdocs/plans/2026-06-25-cas-migration.md
Reconcile the CAS-migration work with main, notably PR #5232 (durable runner-lease sidecar) which independently rewrote ironclaw_turns/filesystem_store.rs — splitting it into io/profile_resolver/ projection/runner_lease submodules and adding a per-run lease sidecar. Resolution of crates/ironclaw_turns/src/filesystem_store.rs: - Keep main's module split + runner-lease sidecar feature. - Keep this PR's routing of the turn-state snapshot RMW through the shared ironclaw_filesystem::cas_update helper (no inline CAS loop, no BridgeError; E is TurnError directly; CasApply::no_op for the inert-snapshot no-op). - The cas_update apply callback applies the runner-lease overlay per retry and uses the OVERLAID snapshot as the no-op baseline, matching main's pre-merge `new_snapshot == old_snapshot` semantics so an active lease overlay does not force a spurious write each apply. Covered by a new regression test (filesystem_turn_state_store_no_op_under_active_lease_overlay_does_not_rewrite_snapshot). - service.rs: add ApprovalStatus::Discarded to the early stale-gate guard for consistency with the exhaustive match arms. Known scoped exception, tracked in #5274: the runner-lease sidecar still uses a local put_with_cas + cas_retry_backoff loop rather than cas_update. Documented in crates/ironclaw_filesystem/CLAUDE.md; the main snapshot RMW does go through cas_update. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed the Codex P1 (and CodeRabbit's matching outside-diff note) on Fixed in 0eaf69d: the non-atomic Regression test added: |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/ironclaw_product_workflow/src/approval_interaction/service.rs (2)
312-332: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winUse
debug!for these fallback diagnostics.These new
warn!calls are internal diagnostics on a best-effort path. In this repo,warn!/info!from runtime flows are reserved out because they corrupt the interactive REPL/TUI display. As per path instructions,REPL/TUI logging: info!/warn! corrupt the terminal UI — internal diagnostics use debug!.Suggested diff
- tracing::warn!( + tracing::debug!( error = %error, capability_id = %policy.key.capability_id, action = ?policy.key.action, "persistent approval policy write failed after approval resolution" ); @@ - tracing::warn!( + tracing::debug!( error = %error, capability_id = %override_key.capability_id, "tool permission override clear failed after persistent approval" );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs` around lines 312 - 332, These fallback diagnostics in approval_interaction::service, specifically the tracing calls inside the persistent_policies.allow and tool_permission_overrides.clear error paths, should use debug! instead of warn! because runtime warn/info logging can disrupt the REPL/TUI. Update both logging sites to debug! while keeping the same error and capability context so the internal best-effort failure reporting remains available without corrupting terminal UI.Source: Path instructions
306-333: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftRoute persistent-approval side effects through the approval layer.
This service now calls
PersistentApprovalPolicyStore::allow()andToolPermissionOverrideStore::clear()directly fromironclaw_product_workflow. That bypasses the approval boundary and leaves the durable allow + override-clear sequence owned by product code instead of the approval layer that already owns resolution semantics. Move this behindApprovalResolutionPort(or a dedicated approval-layer port) and invoke that here. As per path instructions,Approve/deny decisions must go through ApprovalResolutionPort and TurnCoordinator; product/WebUI code must not directly execute tools or mutate approval stores ad hoc.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs` around lines 306 - 333, Route the durable approval side effects out of approval_interaction::service::persist_allow_policy and behind the approval boundary instead of calling PersistentApprovalPolicyStore::allow() and ToolPermissionOverrideStore::clear() directly. Introduce or use an ApprovalResolutionPort-owned method that performs the allow-then-clear sequence, and have persist_allow_policy delegate to that port so the approval layer remains the single owner of resolution semantics and store mutation.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs`:
- Around line 312-332: These fallback diagnostics in
approval_interaction::service, specifically the tracing calls inside the
persistent_policies.allow and tool_permission_overrides.clear error paths,
should use debug! instead of warn! because runtime warn/info logging can disrupt
the REPL/TUI. Update both logging sites to debug! while keeping the same error
and capability context so the internal best-effort failure reporting remains
available without corrupting terminal UI.
- Around line 306-333: Route the durable approval side effects out of
approval_interaction::service::persist_allow_policy and behind the approval
boundary instead of calling PersistentApprovalPolicyStore::allow() and
ToolPermissionOverrideStore::clear() directly. Introduce or use an
ApprovalResolutionPort-owned method that performs the allow-then-clear sequence,
and have persist_allow_policy delegate to that port so the approval layer
remains the single owner of resolution semantics and store mutation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: a098d8a1-20cb-419f-920f-fe4f83865273
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock,!**/Cargo.lock
📒 Files selected for processing (5)
crates/ironclaw_filesystem/CLAUDE.mdcrates/ironclaw_product_workflow/src/approval_interaction/service.rscrates/ironclaw_turns/src/filesystem_store.rscrates/ironclaw_turns/src/filesystem_store/tests.rscrates/ironclaw_turns/tests/filesystem_turn_state_contract.rs
💤 Files with no reviewable changes (3)
- crates/ironclaw_turns/src/filesystem_store/tests.rs
- crates/ironclaw_turns/tests/filesystem_turn_state_contract.rs
- crates/ironclaw_turns/src/filesystem_store.rs
henrypark133
left a comment
There was a problem hiding this comment.
Code Review (multi-agent)
Intent: Replace per-record filesystem mutexes with a shared CAS update helper across Reborn stores to eliminate lock convoys and related leaks.
Stats: 6 findings kept (from 8 raw, 6 after filtering/dedup) across 3 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 1. Filtered: 2 low-signal local-pattern doc notes.
All surviving findings are Medium severity, so I am leaving this as a comment review rather than requesting changes.
Security
- Medium Discarded approvals now keep the full request payload on disk (
crates/ironclaw_run_state/src/lib.rs:947-956, confidence 88) — anchor: crates/ironclaw_run_state/src/lib.rs:947
discard_pendingnow rewrites the filesystem record asDiscardedinstead of deleting it, but it keeps the full originalApprovalRequestbody in the tombstone. That means a cancelled approval's action, reason, requester, and fingerprint remain recoverable from disk/backups even though public APIs hide discarded records. This is a data-minimization regression on an approval control-plane record.
Fix: Store only the minimal tombstone data needed to reserve the request ID, or keep the discarded-ID reservation in a compact side index instead of retaining the full approval payload.
Tests
- Medium Missing test for op-time unsupported-write fail-closed path (
crates/ironclaw_filesystem/src/cas.rs:337-340, confidence 95) — anchor: crates/ironclaw_filesystem/src/cas.rs:337
The helper covers the pre-flight non-CAS rejection, but the second fail-closed path is untested: an unknown/default-capability backend can read successfully and then returnFilesystemError::Unsupported { operation: WriteFile }fromput. That branch maps toCasUpdateError::CasUnsupportedand should be protected because it is the composite-router fallback path.
Fix: Addtests::cas::unsupported_write_file_maps_to_cas_unsupportedusing a default-capabilities backend whose read succeeds and whose CAS write returnsUnsupported(WriteFile).
Performance / Concurrency
- Medium CAS helper clones the full snapshot on every retry (
crates/ironclaw_filesystem/src/cas.rs:299-307, confidence 86) — anchor: crates/ironclaw_filesystem/src/cas.rs:306
cas_updateclones the decoded snapshot before everyapply, and several migrated call sites clone again when buildingCasApply::new(...). For large snapshots such as turn state, each write now copies the whole record at least twice, and contention multiplies that cost across the retry loop. That is avoidable churn on the hot persistence path this PR is trying to improve.
Fix: Passapplya borrowed snapshot or split the helper API so the helper retains one owned copy for equality/write checks while call sites clone only when they truly need ownership.
Conventions
- Medium Filesystem discard now blocks ID reuse, but the in-memory store does not (
crates/ironclaw_run_state/src/lib.rs:947-956, confidence 82) — anchor: crates/ironclaw_run_state/tests/run_state_contract.rs:459 (no diff position — body only)
The filesystem store now writes aDiscardedtombstone, so a latersave_pendingwith the same request ID is rejected even thoughgetandrecords_for_scopehide the record. The in-memory store still removes the record outright, so tests and early wiring can accept a reused approval ID that production rejects. That diverges the store contract introduced by this PR.
Fix: MakeInMemoryApprovalRequestStore::discard_pendingpreserve a discarded tombstone, or separately track discarded IDs sosave_pendingrejects reused IDs in both backends.
Maintainability
- Medium Turn-state apply has become a 1k-line orchestration knot (
crates/ironclaw_turns/src/filesystem_store.rs:381-533, confidence 95) — anchor: crates/ironclaw_turns/src/filesystem_store.rs:381
This refactor pushesFilesystemTurnStateStorefrom 961 to 1051 lines, and the newapplyflow now mixes cache invalidation, runner-lease overlay, transient store construction, FnMut adaptation, CAS no-op detection, timeout policy, and cache update/rollback in one method. That is a lot of state to hold in one control flow, especially in the turn-state persistence path.
Fix: Split the CAS-specific orchestration into smaller helpers, or move runner-lease overlay/cache bookkeeping behind a focused helper soapplycoordinates the transition at a higher level. - Medium The apply timeout is now defined in two places (
crates/ironclaw_turns/src/filesystem_store.rs:77-77, confidence 86) — anchor: crates/ironclaw_turns/src/filesystem_store.rs:77
The turn store keeps its own 15-secondFILESYSTEM_APPLY_TIMEOUT, while the shared CAS helper now owns the same 15-second loop timeout. The outer timeout comment says this may be shorter for tests, but the default value is still duplicated, so a future timeout-policy change can drift the turn store away from the helper contract.
Fix: Use the sharedironclaw_filesystem::FILESYSTEM_APPLY_TIMEOUTas the turn-store default, or remove the outer default when the helper deadline is the only intended policy.
| status: record.status, | ||
| }); | ||
| } | ||
| // Write a Discarded tombstone so the file still exists (preventing |
There was a problem hiding this comment.
Medium — Discarded approvals now keep the full request payload on disk.
discard_pending now rewrites the filesystem record as Discarded instead of deleting it, but it keeps the full original ApprovalRequest body in the tombstone. That means a cancelled approval's action, reason, requester, and fingerprint remain recoverable from disk/backups even though public APIs hide discarded records. This is a data-minimization regression on an approval control-plane record.
Fix: Store only the minimal tombstone data needed to reserve the request ID, or keep the discarded-ID reservation in a compact side index instead of retaining the full approval payload.
There was a problem hiding this comment.
Deferred (consistency, not a discard-specific bug). ApprovalRecord.request is non-optional and Approved/Denied records already retain the identical full payload on disk — the Discarded tombstone is consistent with every other terminal state. Singling out discard for minimization would be the inconsistency. If approval-payload retention matters it's a cross-cutting policy (pruning/redaction of all terminal approval records) plus a schema change (optional/redacted request), which is out of scope for this PR. Tracking as a retention follow-up.
| cas_retry_backoff(attempt).await; | ||
| } | ||
| // 5b. Backend cannot CAS-write — fail closed (no blind overwrite). | ||
| Err(FilesystemError::Unsupported { |
There was a problem hiding this comment.
Medium — Missing test for op-time unsupported-write fail-closed path.
The helper covers the pre-flight non-CAS rejection, but the second fail-closed path is untested: an unknown/default-capability backend can read successfully and then return FilesystemError::Unsupported { operation: WriteFile } from put. That branch maps to CasUpdateError::CasUnsupported and should be protected because it is the composite-router fallback path.
Fix: Add tests::cas::unsupported_write_file_maps_to_cas_unsupported using a default-capabilities backend whose read succeeds and whose CAS write returns Unsupported(WriteFile).
There was a problem hiding this comment.
Fixed in 578e971 — added unsupported_write_file_maps_to_cas_unsupported: a default/unknown-capability backend whose get succeeds and whose put returns Unsupported{WriteFile}, asserting cas_update maps it to CasUpdateError::CasUnsupported. This exercises the op-time fallback path, distinct from the already-tested pre-flight rejection.
| snapshot, | ||
| outcome, | ||
| write, | ||
| } = apply(current.clone()) |
There was a problem hiding this comment.
Medium — CAS helper clones the full snapshot on every retry.
cas_update clones the decoded snapshot before every apply, and several migrated call sites clone again when building CasApply::new(...). For large snapshots such as turn state, each write now copies the whole record at least twice, and contention multiplies that cost across the retry loop. That is avoidable churn on the hot persistence path this PR is trying to improve.
Fix: Pass apply a borrowed snapshot or split the helper API so the helper retains one owned copy for equality/write checks while call sites clone only when they truly need ownership.
There was a problem hiding this comment.
Deferred (low-value perf). Confirmed two in-memory clones per attempt: cas_update clones current (needed — retained for the equality check after apply consumes its copy) and the turns call site clones new_snapshot (threaded into outcome to populate the read cache). Both are struct copies, not IO, and the loop only re-clones on VersionMismatch (rare). Eliminating them requires an API change (return the final snapshot from cas_update, or hash-based equality) for marginal gain — premature without a profile showing turn-state writes are hot. Will revisit with a benchmark.
| // there is no lock contention in practice. | ||
| let apply = Arc::new(Mutex::new(apply)); | ||
|
|
||
| let cas_future = cas_update( |
There was a problem hiding this comment.
Medium — Turn-state apply has become a 1k-line orchestration knot.
This refactor pushes FilesystemTurnStateStore from 961 to 1051 lines, and the new apply flow now mixes cache invalidation, runner-lease overlay, transient store construction, FnMut adaptation, CAS no-op detection, timeout policy, and cache update/rollback in one method. That is a lot of state to hold in one control flow, especially in the turn-state persistence path.
Fix: Split the CAS-specific orchestration into smaller helpers, or move runner-lease overlay/cache bookkeeping behind a focused helper so apply coordinates the transition at a higher level.
There was a problem hiding this comment.
Agreed it's doing a lot, deferred to #5274. The clean move is extracting the per-retry transform (overlay + store build + no-op detection) into a named helper so apply reads as orchestration. Follow-up #5274 (migrate the runner-lease sidecar onto cas_update) already rewrites this exact apply/CAS path, so the decomposition lands there — once — rather than re-churning a just-merged file twice. Added to #5274's scope. (Note: most of the 961→1051 growth is main's own module split landing via the merge; net the file is smaller than this branch's pre-merge 1455.)
| /// unrelated callers behind one wedged operation. | ||
| pub const FILESYSTEM_CAS_RETRIES: usize = 32; | ||
| /// Deadline for the entire read-modify-write loop (including all retries). | ||
| pub const FILESYSTEM_APPLY_TIMEOUT: Duration = Duration::from_secs(15); |
There was a problem hiding this comment.
Medium — The apply timeout is now defined in two places.
The turn store keeps its own 15-second FILESYSTEM_APPLY_TIMEOUT, while the shared CAS helper now owns the same 15-second loop timeout. The outer timeout comment says this may be shorter for tests, but the default value is still duplicated, so a future timeout-policy change can drift the turn store away from the helper contract.
Fix: Use the shared ironclaw_filesystem::FILESYSTEM_APPLY_TIMEOUT as the turn-store default, or remove the outer default when the helper deadline is the only intended policy.
There was a problem hiding this comment.
Fixed in 578e971 — the turn store now uses the shared ironclaw_filesystem::FILESYSTEM_APPLY_TIMEOUT instead of a duplicated local constant; the per-store override via with_apply_timeout (used in tests) is unchanged.
- cas/tests.rs: add `unsupported_write_file_maps_to_cas_unsupported` covering
the op-time fail-closed path (default-capability backend reads OK, then `put`
returns Unsupported{WriteFile} → CasUpdateError::CasUnsupported) — previously
only the pre-flight non-CAS rejection was tested.
- run_state tests: replace the sequential discard regression with a
deterministic TOCTOU harness (RootFilesystem decorator injects a racing
approve() during discard's read→write window) so the CAS-retry-after-version-
bump path is actually exercised; discard loses the race and returns
ApprovalNotPending without clobbering the resolved record.
- turns/filesystem_store.rs: use the shared ironclaw_filesystem::
FILESYSTEM_APPLY_TIMEOUT instead of a duplicated local 15s constant.
- docs/plans: quality gate uses check-mode `cargo fmt --all -- --check`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/plans/2026-06-25-cas-migration.md (1)
141-146: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winAdd
cargo buildto the quality gate.Line 141 updates the checklist, but the repo’s required pre-PR validation still includes
cargo build. Omitting it leaves this migration plan weaker than the documented submission gate.As per coding guidelines, "Before You Open a PR" requires
cargo build.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/plans/2026-06-25-cas-migration.md` around lines 141 - 146, The quality gate checklist is missing the required build validation, so update the plan’s validation list to include cargo build alongside the existing cargo fmt/check/clippy/test commands. Make the change in the checklist section that enumerates the verification steps so the migration plan matches the repo’s pre-PR submission gate.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 511-582: The test only checks the საბოლოary
CasUpdateError::CasUnsupported, so it can still pass even if cas_update fails
earlier in the pre-flight capability check and never reaches
UnsupportedWriteBackend::put. Update the UnsupportedWriteBackend test helper to
record when put is invoked, and in
unsupported_write_file_maps_to_cas_unsupported assert both that cas_update
returns CasUpdateError::CasUnsupported and that put was actually called, so the
op-time fallback path is proven.
---
Outside diff comments:
In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 141-146: The quality gate checklist is missing the required build
validation, so update the plan’s validation list to include cargo build
alongside the existing cargo fmt/check/clippy/test commands. Make the change in
the checklist section that enumerates the verification steps so the migration
plan matches the repo’s pre-PR submission gate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 947d1e05-ffb2-4cfc-ac10-331d3eba110a
📒 Files selected for processing (5)
crates/ironclaw_filesystem/src/cas/tests.rscrates/ironclaw_product_workflow/src/approval_interaction/service.rscrates/ironclaw_run_state/tests/run_state_contract.rscrates/ironclaw_turns/src/filesystem_store.rsdocs/plans/2026-06-25-cas-migration.md
henrypark133
left a comment
There was a problem hiding this comment.
Code Review (multi-agent)
Intent: Replace per-record filesystem mutex convoys with a shared CAS update helper and migrate all Reborn stores to it.
Stats: 4 findings kept (from 5 raw, 4 after filtering/dedup) across 2 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.
All surviving findings are Medium/Low, so this is a comment review rather than a request for changes. I did not repost prior threads that already have fixed replies or sound deferrals; the turn-state apply-structure concern is covered by the existing #5274 deferral.
Tests
- Medium cas_update lacks a generic backend-error regression test (
crates/ironclaw_filesystem/src/cas.rs:330-341, confidence 78) — anchor: crates/ironclaw_filesystem/src/cas.rs:341
The helper has coverage for first write, no-op, contention, timeout, and unsupported-write, but not the fallbackErr(error) => CasUpdateError::Backend(error)branch on a normal write failure. A backend returning a non-UnsupportedFilesystemErrorfromputwould still be unpinned here.
Fix: Addtests::cas::backend_put_error_maps_to_backendcovering a successful read followed by a generic backend error fromput. - Low Discarded approval status is untested (
crates/ironclaw_capabilities/src/helpers.rs:273-279, confidence 84) — anchor: crates/ironclaw_capabilities/src/helpers.rs:273
The newApprovalStatus::Discardedarm has no direct regression test. Nothing asserts that discarded approvals map toApprovalDiscarded, so this one-line branch could drift silently.
Fix: Addtests::helpers::approval_not_approved_error_kind_maps_discarded_statuscoveringApprovalStatus::Discarded -> "ApprovalDiscarded".
Performance
- Low Last retry still pays a backoff sleep (
crates/ironclaw_filesystem/src/cas.rs:333-334, confidence 89) — anchor: crates/ironclaw_filesystem/src/cas.rs:333
The loop always awaitscas_retry_backoff(attempt)onVersionMismatch, including the final permitted attempt. That adds an avoidable 2-50ms of tail latency to every exhausted CAS call on the shared hot path, with no chance of success because no retry remains.
Fix: Skip the sleep whenattempt + 1 == FILESYSTEM_CAS_RETRIESand returnRetriesExhaustedimmediately.
Local Patterns
- Low Fix the equality-path example in the public API docs (
crates/ironclaw_filesystem/src/cas.rs:105-108, confidence 91) — anchor: crates/ironclaw_filesystem/src/cas.rs:105
The publicCasApplydocs describe the unchanged-snapshot fast path asSome(existing) == snapshot, butsnapshotis anSwhilecurrentis theOption<S>. As written, the example is type-mismatched and makes the two no-op modes harder to distinguish for store authors reading the API surface.
Fix: Rewrite the sentence in terms of the actual comparison, for examplecurrentisSome(existing)and the returned snapshot equalsexisting.
| operation: FilesystemOperation::WriteFile, | ||
| .. | ||
| }) => return Err(CasUpdateError::CasUnsupported), | ||
| Err(error) => return Err(CasUpdateError::Backend(error)), |
There was a problem hiding this comment.
Medium — cas_update lacks a generic backend-error regression test.
The helper has coverage for first write, no-op, contention, timeout, and unsupported-write, but not the fallback Err(error) => CasUpdateError::Backend(error) branch on a normal write failure. A backend returning a non-Unsupported FilesystemError from put would still be unpinned here.
Fix: Add tests::cas::backend_put_error_maps_to_backend covering a successful read followed by a generic backend error from put.
There was a problem hiding this comment.
Fixed in 4b061dd — added backend_put_error_maps_to_backend: a CAS-capable backend whose get succeeds and whose put returns a generic (non-VersionMismatch, non-Unsupported{WriteFile}) FilesystemError, asserting cas_update maps it to CasUpdateError::Backend(_). Covers the catch-all arm at cas.rs:341.
| Ok(_) => return Ok(outcome), | ||
| // 5a. Lost the CAS race — re-read and retry with backoff. | ||
| Err(FilesystemError::VersionMismatch { .. }) => { | ||
| cas_retry_backoff(attempt).await; |
There was a problem hiding this comment.
Low — Last retry still pays a backoff sleep.
The loop always awaits cas_retry_backoff(attempt) on VersionMismatch, including the final permitted attempt. That adds an avoidable 2-50ms of tail latency to every exhausted CAS call on the shared hot path, with no chance of success because no retry remains.
Fix: Skip the sleep when attempt + 1 == FILESYSTEM_CAS_RETRIES and return RetriesExhausted immediately.
There was a problem hiding this comment.
Fixed in 4b061dd — the VersionMismatch arm now guards the sleep with if attempt + 1 < FILESYSTEM_CAS_RETRIES, so the final exhausted attempt returns RetriesExhausted immediately without the 2-50ms backoff. Retry count unchanged.
| ApprovalStatus::Approved => "ApprovalApproved", | ||
| ApprovalStatus::Denied => "ApprovalDenied", | ||
| ApprovalStatus::Expired => "ApprovalExpired", | ||
| ApprovalStatus::Discarded => "ApprovalDiscarded", |
There was a problem hiding this comment.
Low — Discarded approval status is untested.
The new ApprovalStatus::Discarded arm has no direct regression test. Nothing asserts that discarded approvals map to ApprovalDiscarded, so this one-line branch could drift silently.
Fix: Add tests::helpers::approval_not_approved_error_kind_maps_discarded_status covering ApprovalStatus::Discarded -> "ApprovalDiscarded".
There was a problem hiding this comment.
Fixed in 4b061dd — added approval_not_approved_error_kind_maps_discarded_status asserting ApprovalStatus::Discarded -> "ApprovalDiscarded".
| /// Two no-op signals are available, covering the two cases where `apply` should | ||
| /// skip the write entirely: | ||
| /// | ||
| /// 1. **Existing record, unchanged snapshot** — return `CasApply::new` with |
There was a problem hiding this comment.
Low — Fix the equality-path example in the public API docs.
The public CasApply docs describe the unchanged-snapshot fast path as Some(existing) == snapshot, but snapshot is an S while current is the Option<S>. As written, the example is type-mismatched and makes the two no-op modes harder to distinguish for store authors reading the API surface.
Fix: Rewrite the sentence in terms of the actual comparison, for example current is Some(existing) and the returned snapshot equals existing.
There was a problem hiding this comment.
Fixed in 4b061dd — the CasApply doc example no longer mismatches types; it now describes the actual check (current is Some(existing) and the returned snapshot equals existing), matching matches!(¤t, Some(e) if *e == snapshot) and distinguishing it from the explicit CasApply::no_op path.
- cas/tests.rs: assert `UnsupportedWriteBackend::put` was actually invoked in `unsupported_write_file_maps_to_cas_unsupported` (proves the op-time fallback path, not just the final enum which pre-flight also yields). - cas/tests.rs: add `backend_put_error_maps_to_backend` covering the generic `put` failure → `CasUpdateError::Backend` catch-all arm. - cas.rs: skip the jittered backoff sleep on the final CAS attempt (`attempt + 1 < FILESYSTEM_CAS_RETRIES`) — no retry remains, so it was pure tail latency on the shared hot path. - cas.rs: fix the type-mismatched equality example in the `CasApply` docs (`current` is `Some(existing)` and the returned snapshot equals `existing`). - capabilities/helpers.rs: regression test for `ApprovalStatus::Discarded -> "ApprovalDiscarded"`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
henrypark133
left a comment
There was a problem hiding this comment.
Code Review (multi-agent)
Intent: Remove per-record filesystem mutex convoys by migrating Reborn filesystem-backed stores to a shared CAS update helper.
Stats: 4 findings (from 8 reviewer lanes; 1 High, 3 Medium) across 1 file. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.
Prior-thread context was considered. I did not repost the already-addressed items or the soundly deferred follow-ups around Send bound creep, BackendCapabilities::default() as an explicit-design cleanup, approval retention/tombstone policy, clone-cost benchmarking, or turn-state decomposition in #5274.
Bugs
- High Byte-only backends still slip through on first write (
crates/ironclaw_filesystem/src/cas.rs:256-347, confidence 88) — anchor:crates/ironclaw_filesystem/src/cas.rs:260
The fail-closed CAS gate can bypass default-capability composite roots and then successfully first-write JSON byte snapshots on local/byte-only mounts viaCasExpectation::Absent.
Tests
- Medium Missing test for backend read failure propagation (
crates/ironclaw_filesystem/src/cas.rs:292-298, confidence 93) — anchor:crates/ironclaw_filesystem/src/cas.rs:298
The helper'sgeterror branch should be pinned asCasUpdateError::Backend. - Medium Missing test for malformed snapshot decode failure (
crates/ironclaw_filesystem/src/cas.rs:293-295, confidence 98) — anchor:crates/ironclaw_filesystem/src/cas.rs:294
Malformed persisted bytes should be pinned asCasUpdateError::Applyat the helper boundary. - Medium Missing test for encode failure propagation (
crates/ironclaw_filesystem/src/cas.rs:325-327, confidence 98) — anchor:crates/ironclaw_filesystem/src/cas.rs:327
Encoder failures should be pinned asCasUpdateError::Applywithout writing.
| // *known* shape; the composite router advertises the empty default and | ||
| // defers to the op-time check below. Either way we never fall back to | ||
| // `CasExpectation::Any`. | ||
| let capabilities = filesystem.capabilities(); |
There was a problem hiding this comment.
High — Byte-only backends still slip through on first write.
The fail-closed gate only rejects backends that advertise a non-default capability shape or that return Unsupported(WriteFile) from put. A composite filesystem reports BackendCapabilities::default() before path resolution, then routes the write to the mounted backend. LocalFilesystem accepts CasExpectation::Absent for plain byte entries, and CAS snapshots are JSON byte entries, so the first snapshot write on a local/byte-only mount can succeed instead of surfacing CasUnsupported. Later Version(...) writes fail, but the helper has already violated its no-CAS fail-closed contract for fresh records.
Fix: Do not infer CAS support from the root composite capabilities alone. Gate against path-resolved mount capabilities, or reject unknown/default roots before issuing any write so Absent writes cannot succeed on non-CAS mounts.
There was a problem hiding this comment.
Fixed in 0d4bc1c — made all CAS-store encoders record-shaped (entry.kind = Some(RecordKind)), so LocalFilesystem's existing record-shaped rejection fires fail-closed uniformly on the Absent first-write too (turns was already protected; run_state/secrets/resources/threads were not). Self-enforcing at the data-model level, no new helper branches. Severity is really Medium — Absent is create-if-absent and all updates fail-closed, so there is no blind overwrite / lost update; only creation succeeded then updates failed loudly. New record_shaped_first_write_fails_closed_on_byte_only_backend + a store-level fail-closed test over a real byte-only LocalFilesystem mount; verified across the postgres/libsql/all-features matrix.
| (Some(decoded), Some(versioned.version)) | ||
| } | ||
| Ok(None) => (None, None), | ||
| Err(error) => return Err(CasUpdateError::Backend(error)), |
There was a problem hiding this comment.
Medium — Missing test for backend read failure propagation.
The filesystem.get(...).await error arm now returns CasUpdateError::Backend, but the new CAS helper tests never inject a read failure. The suite covers contention, capability gating, timeout, and put failures, but not the branch that should surface a backend failure before a snapshot can be loaded.
Fix: Add tests::cas::get_error_is_returned_as_backend using a backend whose get returns a non-NotFound FilesystemError and assert cas_update returns CasUpdateError::Backend(_).
There was a problem hiding this comment.
Fixed in 0d4bc1c — added backend_get_error_maps_to_backend pinning a get failure to CasUpdateError::Backend.
| // committed version, never a torn one. | ||
| let (current, version) = match filesystem.get(scope, path).await { | ||
| Ok(Some(versioned)) => { | ||
| let decoded = decode(&versioned.entry.body).map_err(CasUpdateError::Apply)?; |
There was a problem hiding this comment.
Medium — Missing test for malformed snapshot decode failure.
The decode(&versioned.entry.body) failure path maps to CasUpdateError::Apply, but no test drives cas_update with invalid bytes in an existing record. That helper boundary should be pinned directly because malformed persisted snapshots are a distinct failure mode from apply-closure errors.
Fix: Add tests::cas::decode_error_is_returned_as_apply with an existing record containing malformed JSON and assert the decode error is returned through CasUpdateError::Apply.
There was a problem hiding this comment.
Fixed in 0d4bc1c — added malformed_snapshot_decode_maps_to_apply pinning a decode failure to CasUpdateError::Apply.
|
|
||
| // 4. Encode + CAS put. Absent → create-if-absent; existing → version | ||
| // precondition. | ||
| let entry = encode(&snapshot).map_err(CasUpdateError::Apply)?; |
There was a problem hiding this comment.
Medium — Missing test for encode failure propagation.
The encode(&snapshot) failure path also maps to CasUpdateError::Apply, but the current tests only use an encoder that succeeds. A serialization failure would therefore be unpinned at the shared helper boundary.
Fix: Add tests::cas::encode_error_is_returned_as_apply using an encoder that returns TestError after apply produces a changed snapshot, and assert CasUpdateError::Apply is returned without writing.
There was a problem hiding this comment.
Fixed in 0d4bc1c — added encode_failure_maps_to_apply_without_write pinning an encode failure to CasUpdateError::Apply and asserting no put was attempted.
… tests Multi-agent review (High): cas_update's fail-closed guarantee had a gap on the first (absent) write. A default/unknown-capability backend skips the pre-flight gate, and LocalFilesystem.put rejects only record-shaped entries (entry.kind.is_some()) — so a BYTE-ONLY entry written with CasExpectation::Absent was accepted, slipping past the op-time fail-closed check. ironclaw_turns already encoded record-shaped entries (kind set) and was protected; run_state, secrets, resources, and threads encoded byte-only entries and were not. Fix (make the contract self-enforcing at the data-model level, not mount-config dependent): set entry.kind on every CAS-store encoder so they are record-shaped like turns. LocalFilesystem's existing record-shaped rejection then makes the op-time fail-closed fire uniformly on the Absent path too — no new branches in the shared helper. Reads are unaffected (body decode ignores kind); additive vs existing kind=None rows. - run_state: run-record + approval-record encoders set kind; store-level fail-closed regression test over a byte-only LocalFilesystem mount. - secrets: secret / lease / account / session encoders set kind. - resources: Snapshot trait gains RECORD_KIND; cas_snapshot encoder stamps it; ResourceGovernor + BudgetGate snapshots declare their kind. - threads: session / message / summary / idempotency encoders set kind. - cas/tests.rs: `record_shaped_first_write_fails_closed_on_byte_only_backend` proves the mechanism end-to-end (kind-gated put → CasUnsupported on first write), plus error-propagation tests: get-failure → Backend, decode-failure → Apply, encode-failure → Apply (no write). Note (kind identifiers): RecordKind validates as `[A-Za-z_][A-Za-z0-9_]*`, so snake_case kinds (matching turns' `turn_state_snapshot`), not kebab-case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed the latest multi-agent review (1 High + 3 Medium tests) in 0d4bc1c. High — byte-only backends slip through on first write. Valid, and a touch broader than flagged: Medium ×3 — missing error-path tests. Added: read failure → |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/ironclaw_threads/src/filesystem_service.rs (1)
1118-1125: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winValidate the stored thread ID before returning an existing record.
Line 1120 only scope-checks
existing; if the path contains a same-scope record for a differentthread_id,ensure_threadreturns the wrong thread as successfully ensured. Matchread_thread_versioned’s identity check and reject mismatched IDs.Proposed fix
- if existing.record.scope != scope { + if existing.record.scope != scope || existing.record.thread_id != thread_id { Err(SessionThreadError::ThreadScopeMismatch { thread_id }) } else {As per coding guidelines, “Preserve tenant/user/agent/project/mission/thread scope on authority, state, memory, process, network, outbound, resource, and event records.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ironclaw_threads/src/filesystem_service.rs` around lines 1118 - 1125, The existing-return path in ensure_thread only checks scope and can incorrectly accept a record whose stored thread_id does not match the requested thread_id. Update the Some(existing) branch in filesystem_service::ensure_thread to also validate existing.record.thread_id against the requested thread_id, matching the identity check used by read_thread_versioned, and return a thread-mismatch error when the IDs differ before returning CasApply::new(existing.clone(), existing.record).Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 975-995: The byte-only mock in KindGatedByteOnlyBackend::put only
verifies the entry shape and ignores the cas argument, so the test still passes
even if cas_update stops passing CasExpectation::Absent for first writes. Update
the mock to capture and assert the CAS precondition alongside path and entry,
and ensure the test exercises the absent-write case through cas_update so every
argument the production caller passes is validated.
---
Outside diff comments:
In `@crates/ironclaw_threads/src/filesystem_service.rs`:
- Around line 1118-1125: The existing-return path in ensure_thread only checks
scope and can incorrectly accept a record whose stored thread_id does not match
the requested thread_id. Update the Some(existing) branch in
filesystem_service::ensure_thread to also validate existing.record.thread_id
against the requested thread_id, matching the identity check used by
read_thread_versioned, and return a thread-mismatch error when the IDs differ
before returning CasApply::new(existing.clone(), existing.record).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 7c6c776b-3c08-4c47-8ace-e275948a1864
📒 Files selected for processing (8)
crates/ironclaw_filesystem/src/cas/tests.rscrates/ironclaw_resources/src/cas_snapshot.rscrates/ironclaw_resources/src/filesystem_store.rscrates/ironclaw_resources/src/lib.rscrates/ironclaw_run_state/src/lib.rscrates/ironclaw_run_state/tests/run_state_contract.rscrates/ironclaw_secrets/src/filesystem_store.rscrates/ironclaw_threads/src/filesystem_service.rs
- cas/tests.rs: the byte-only mock's `put` now captures `cas` and asserts `CasExpectation::Absent`, so the fail-closed regression also pins that cas_update uses Absent for the first (absent) write — not just the record-shaped op-time rejection. - threads/filesystem_service.rs: ensure_thread's existing-return branch now rejects a thread_id mismatch as well as a scope mismatch, matching read_thread_versioned (line 236) and the message-read guard. The thread path is thread_id-keyed so this is defense-in-depth parity rather than a reachable bug via the public API; bringing it in line with the canonical read path removes the inconsistency a reviewer flagged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Re the outside-diff Major on |
henrypark133
left a comment
There was a problem hiding this comment.
Code Review (multi-agent)
Intent: Remove per-record filesystem lock convoys by migrating Reborn persistence stores to a shared CAS update helper.
Stats: 2 findings kept (from 6 raw final-head reviewer findings, 2 after live-thread/stale filtering) across 2 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 1.
Prior-thread context was considered. I filtered out the repeated CAS helper findings for byte-only backends and get/decode/encode error-path tests because the live threads already mark them fixed on the current series and the final head contains the corresponding tests.
Tests
- Medium Cover consuming an already-expired lease (
crates/ironclaw_secrets/src/filesystem_store.rs:498-504, confidence 86) — anchor:crates/ironclaw_secrets/src/filesystem_store.rs:498
The newSecretLeaseStatus::Expiredbranch inFilesystemSecretStore::consumereturnsLeaseExpireddirectly when the lease record is already markedExpired, but there is no regression test for that already-expired path. A zero-TTL or pre-aged lease could start rewriting state or surfacing the wrong error without a failing test.
Local Patterns
- Low Drop the stale per-path lock description from the CAS snapshot docs (
crates/ironclaw_resources/src/cas_snapshot.rs:8-10, confidence 95) — anchor:crates/ironclaw_resources/src/cas_snapshot.rs:8(no diff position — body only)
The module docs still say the shared snapshot stores use an in-process per-path async lock that serializes same-process writers, but this module now routes updates throughCasSnapshotStoreand the shared lock-freecas_updatehelper. That wording is now inaccurate and can mislead future maintainers about the actual concurrency model, especially when adding new stores.
| SecretLeaseStatus::Revoked => { | ||
| Err(SecretStoreError::LeaseRevoked { lease_id }) | ||
| } | ||
| SecretLeaseStatus::Expired => { |
There was a problem hiding this comment.
Medium — Cover consuming an already-expired lease.
The new SecretLeaseStatus::Expired branch in FilesystemSecretStore::consume returns LeaseExpired directly when the lease record is already marked Expired, but there is no regression test for that already-expired path. A zero-TTL or pre-aged lease could start rewriting state or surfacing the wrong error without a failing test.
Fix: Add a filesystem secret-store test that creates or marks an already-expired lease, calls consume, and asserts LeaseExpired is returned without issuing another write.
There was a problem hiding this comment.
Fixed in 805bd15 — added consume_already_expired_lease_returns_expired_without_rewrite. It forces a genuine stored Expired state (lease TTL = -1s → first consume promotes Active→Expired with one write → second consume hits the already_marked branch), asserts Err(SecretStoreError::LeaseExpired), and captures the InMemoryBackend record version before/after to prove the already-marked path issues NO second write (version unchanged).
… LocalFilesystem CI regression from making the CAS-store encoders record-shaped (0d4bc1c): LocalFilesystem.put rejects record-shaped entries (entry.kind.is_some()) with Unsupported{WriteFile}, which cas_update maps to fail-closed CasUnsupported. Several reborn test harnesses backed run-state / approval / session-thread stores with a byte-only LocalFilesystem — a backend production never uses for these stores (the filesystem guardrail states all production store mounts resolve to CAS-capable db/in-memory backends; LocalFilesystem is structurally unreachable). So the failures are the intended fail-closed contract correctly rejecting a non-production backend, not a code regression. Fix the harnesses, not the code: route the CAS stores through a shared Arc<InMemoryBackend> (full BackendCapabilities, CAS-capable) which models the production database-backed filesystem. The durable-restart test keeps its three service-graph instances over one shared backend, preserving the state-survives-restart semantics; LocalFilesystem still backs the byte-only capability-lease / process services (which don't set RecordKind). No assertions weakened, no tests ignored. - crates/ironclaw_host_runtime/tests/reborn_durable_restart_integration.rs - tests/support/reborn/session_thread.rs - tests/support/reborn/harness.rs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stale CAS doc - secrets: add `consume_already_expired_lease_returns_expired_without_rewrite`. The `SecretLeaseStatus::Expired` already-marked branch of `consume` returns LeaseExpired via the apply-error path with no write; the test forces a stored Expired state (TTL=-1 → first consume promotes to Expired → second consume hits the already-marked path) and asserts the backend record version does NOT bump (no second write). - resources/cas_snapshot.rs: module doc no longer describes an in-process per-path async lock — it now routes through the lock-free shared cas_update helper (versioned CAS + retry on VersionMismatch + jittered backoff). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Re the Low (body-only) finding on |
Problem
Each Reborn persistence store wrapped its filesystem read-modify-write in a per-record
tokio::sync::Mutex(FILESYSTEM_RECORD_LOCKS) held across.await— a redundant in-process serializer over backends that already do versioned CAS. Under burst, one writer stalled inside its critical section blocks every other writer for that scope (the convoy that contributed to the runtime wedge). PR #5142 removed exactly this fromironclaw_turns; this PR finishes the job for the remaining stores via one shared helper.Change
Shared helper (
ironclaw_filesystem::cas_update): one mutex-free, bounded read-modify-write loop — read versioned snapshot → idempotentapplyclosure → CAS put → onVersionMismatchre-read and retry (32×, jittered 2–50ms backoff, 15s timeout), with a fail-closed capability gate. Generic over record/outcome/error; leaks no store types.Migrated all 5 stores onto it and deleted every per-record mutex:
ironclaw_turns— re-homed its local fix(turns): prevent turn-state write convoy #5142 copy onto the shared helper (one owner).ironclaw_run_state,ironclaw_threads— dropped the mutex; route through the helper.ironclaw_resources— gains a retry loop it never had (the lock was its only serializer).ironclaw_secrets— deletes theArc-keyed lock map → fixes an unbounded memory leak (it storedArc, never pruned).Post-condition:
grep FILESYSTEM_RECORD_LOCKS / record_lock.lock().await / filesystem_secret_lockacross the 5 store crates is empty.Guardrail:
.claude/rules/database.mdrecords the invariant — filesystem read-modify-write must go throughcas_update; never wrap it in a per-record mutex held across.await.Tri-backend parity (no diverging experience by host)
The helper is
RootFilesystem-level (backend-agnostic). Verified across all threeironclaw_filesystembackends:--features postgres): CAS contract tests pass —postgres_native_put_cas_absent_rejects_existing_path,postgres_native_put_cas_any_increments_existing_version,postgres_transaction_rollback_discards_prior_put_after_later_cas_conflict.--features libsql): CAS contract tests pass. (One unrelated append/dir contract test is flaky under parallel suite execution — passes 3/3 in isolation;db.rsis untouched by this PR, so it is pre-existing test-infra flakiness, not a regression.)Notes
🤖 Generated with Claude Code